Follow-up on "Self-play, Deep Search, and Diminishing Returns"
نویسنده
چکیده
This follow-up on my prior self-play contributions to the ICGA Journal and elsewhere discusses a variety of topics related to self-play research in general and my own self-play experiment with FRITZ 6 in particular. The text covers issues of comparability as for search depths, the detection of duplicate games, the usage of opening books and other databases, the intention of calibration matches, the potential influences of transposition tables and pre-processing at the root node, and – last but not least – several points that are crucial for the statistical analysis of self-play matches. “Beware of bugs in the above code; I have only proved it correct, not tried it.” — Donald E. Knuth 1. APOLOGIES I should have written a follow-up to my self-play publications (Heinz, 2000a, 2000b, 2001a, 2001b, 2001c) much earlier because many people interested in my work have been waiting for answers and comments to their questions far too long already – my sincere apologies to each and every one of you! The simple explanation why I did not react before is that I temporarily lost interest in computer chess and computer gameplaying in general during the past 18 months or so. This is also the reason for not making more preprints of my self-play publications and the raw game data of my self-play experiments publicly available on the Internet. Apologies again to those still waiting and I do promise to have my new personal research pages at http://www.i-u.de/schools/heinz/up and running as soon as possible. I intend a substantial subset of these pages to become some kind of central repository for data and information about self-play research. So, everybody, please submit your own findings and results to me! Actually, I am really pleased and even feel somewhat honoured by the lively debates and strong reactions that my self-play research has sparked. Hopefully, the comments and discussions below provide some new insights and food of thought for many readers and not just the self-play experts. 2. GENERAL COMMENTS AND DISCUSSION 2.1 Meaning and Comparability of Search Depths Among the first things everybody learns in computer chess is that the meaning of search depth and the shape of the search trees differ widely from program to program and might even do so between successive versions of the very same program. So, what does searching to a fixed depth actually refer to then? In Section 1.1 of (Heinz, 2001a), I replied as follows to the above. “ The ‘fixed depth’ question is not trivial because the modes of operation of the programs differ substantially depending on its real meaning. In the search-theoretical sense, ‘fixed depth’ denotes true brute-force search with uniform path lengths from the root to all horizon nodes and no selectivity at all – neither by means 1 International University (IU) in Germany, School of Information Technology, P.O. Box 1550, D-76605 Bruchsal, F. R. Germany. Email: [email protected], WWW = http://www.i-u.de/schools/heinz/. 76 ICGA Journal June 2003 of depth reductions or other kinds of forward pruning nor by any search extensions. In computer-chess practice, however, ‘fixed depth’ usually equals ‘fixed iteration depth’ which relates to the depth limit of iterative deepening as performed by the top-level search control. Here, the programs operate with an iteration limit instead of a time bound but otherwise execute their sophisticated variable-depth search procedure as built in – with all kinds of depth reductions, forward pruning, and search extensions enabled. ” Of course, this has severe implications regarding comparisons based on search depths for self-play matches involving different programs. Strictly speaking, they are hardly comparable at all. Hence, it is the more surprising that self-play experiments in computer chess really seem to yield very similar results overall (Heinz, 2001a, 2001c). Maybe the specific characteristics of the different search implementations are levelled out in self-play because here each match pits two program versions with some search handicap but otherwise of the very same, unique nature against each other. Then, only the handicap remains visible in the final match results. This might explain why the relative self-play strengths of extremely diverse programs like BELLE (Thompson, 1982), HITECH (Berliner et al. 1990, and FRITZ 6 (Heinz, 2001a, 2001c) resemble each other so much for fixed iteration search depths of 6 to 9 plies. 2.2 Duplicate Games The previous self-play works by others do not mention duplicate games at all. Yet, the presence of duplicates in self-play matches is clearly undesirable because they do not provide any new information about the relative playing strengths of the respective program versions. For the sake of sample independence, the removal of duplicate games would also be nice prior to any statistical analysis of the matche results (see Subsection 3.1). Unfortunately, however, the detection of duplicates turns out to be especially tricky in the context of self-play because not only exact duplicates ought to be weeded out here. Essentially, all games that feature duplicate “tails” without much real play briefly after the last successful look-up in the opening book or associated opening databases mostly engage the book engines rather than the search engines of the opponents. As the book engines are identical in self-play, “soft” duplicates such as these hardly test the relative playing strengths of two program versions that actually differ by some search handicap. Hence, detecting soft duplicates requires quite some effort which I did not deem worthwhile. In order not to introduce any artificial bias by removing only the easy-to-spot duplicates, I finally left all duplicate games in the matches as they originally occurred. 2.3 Opening Books and Other Databases Self-play research could easily avoid duplicate games in matches (see above) by using a set of balanced positions to start play from without any opening books enabled. The main problem here is that assembling a sufficiently large number of such positions seems elusive at best. GM John Nunn started an according endeavour some time back already but has not come up with more than tens (the so-called “Nunn Positions”) rather than the necessary hundreds or thousands of suitable candidates. So far, the best possible set-up actually remains playing matches with a wide well-balanced opening book and random move selection from it enabled. Repeating each random opening choice with colours reversed further adds to the fairness of matches. As self-play tries to exercise search handicaps in particular, other persistent position caches such as endgame databases and learning files should normally be disabled too. It might still be interesting, though, to investigate in more detail how these databases influence self-play results at progressing search depths. 2.4 Calibration Matches Opening books and chess programs themselves may exhibit consistently better play for a specific colour (i.e., either Black or White). This is widely known and mostly results from unbalanced data (e.g., a book contains good continuations for White only) and coding bugs (e.g., an important search extension triggered for Black only). In order to make sure that a self-play experiment does not merely measure some idiosyncracies related to skewed Black / White behaviour of the base program, it seems like a good idea to execute a “calibration match” between two identical program versions without any search handicap. The expected result of such a calibration match between two identical opponents playing opposite sides is a roughly even score of course, Follow-up on “Self-play, Deep Search, and Diminishing Returns” 77 possibly somewhat tilted in favour of White due to its advantage of moving first in chess. If the actual score of the calibration match deviates substantially from 50 percent with good statistical confidence, then some unwanted Black / White skew definitely plagues the opening book and / or the chess program itself. FRITZ 6 at a fixed iteration depth of 8 plies using the wide general opening book passed the calibration test by a score of 52.3 percent for White (924 wins, 1288 draws, and 788 losses out of 3000 games) in my own self-play experiment (Heinz, 2000b, 2001a). 2.5 Transposition Tables Although it might not seem directly obvious, all program versions involved in a self-play match must use transposition tables of the same size. This is a typical example of keeping constant as many factors as possible in an experiment. It also holds for fixed-depth matches because hits in the transposition table near the horizon tend to increase the effective look-ahead significantly even for searches to nominally fixed depths. Hence, larger transposition tables always represent an unfair advantage for any program version. Whether the shallower or deeper searching versions benefit more from them is not intuitively clear. Additionally, the influence of the table size on self-play results at progressing search depths is an interesting, yet still open question. 2.6 Pre-processing at Root Node Several people raised strong concerns as for the extensive pre-processing at the root node allegedly done by FRITZ 6 and attributed the diminishing returns visible in my self-play experiment mainly to this factor. Strangely enough, nobody seemed to care about the pre-processing issue in connection with all the other previously published self-play results. In particular HITECH and LOTECH used by Berliner et al. (1990) are well-known for being completely dependent on aggressive pre-processing at the root node. The effect thereof, however, was never discussed or questioned anywhere. This is not to say that I do not take the concerns about pre-processing seriously or deem them to be void. Quite to the contrary, I was fully aware of the pre-processing problem from the very beginning! Exactly therefore, I already stated the following at the end of Section 6 in Heinz (2001a) to settle the issue regarding FRITZ 6. “ Last but not least, we like to mention that FRITZ 6 does only very limited pre-processing at the root node (according to personal communication with its author, Frans Morsch). Hence, the effects of diminishing returns observed in our experiment are not simply caused by extensive pre-processing. Schubert’s (2000) recent report in a German computer-chess periodical provides further evidence for this. After our publication of some intermediate results (Heinz, 2000b), Schubert conducted his own self-play experiments with HIARCS 7.32 and an enhanced version of FRITZ 6 which pre-processes even less than the original release version used by us. HIARCS 7.32 is known to be a slow and knowledge-intensive chess program that hardly relies on any preprocessing but rather expands much effort to analyze each node during the search in great detail. Diminishing returns were visible for both FRITZ 6 and HIARCS 7.32 in Schubert’s experiment. ” 3. COMMENTS AND DISCUSSION ON THE STATISTICAL ANALYSES 3.1 Removal of Duplicate Games A trivially obvious point of criticism is the issue of sample independence raised by the potential presence of duplicate games in self-play matches. This was justly brought up and levelled towards the statistics employed in my publications by many readers. However, several people then took the argument a step further for a rather quick and unjust all-out dismissal of the statistical analyses along the following line of reasoning: “duplicate games samples in matches not independent standard statistics not applicable calculations incorrect whole work meaningless.” Rebutting the above, let me now explain why I do not only deem the reasoning to be overly simplistic but plainly wrong indeed. The error springs from a common misconception about the applicability of standard statistics. They actually work fine as long as the set of samples is representative of the whole population and not biased towards consistently overor underestimating the measured feature. The normal way to avoid such 78 ICGA Journal June 2003 Depth w s(w) 90%-C w 95%-C w Binomial W / D / L Binomial W / D / L Binomial W / D / L 6 5 0.715 0.008 0.007 0.402, 0.456 0.408, 0.450 0.397, 0.461 0.403, 0.455 7 6 0.725 0.008 0.006 0.424, 0.477 0.431, 0.471 0.419, 0.483 0.427, 0.474 8 7 0.688 0.008 0.006 0.347, 0.403 0.355, 0.396 0.342, 0.409 0.351, 0.399 9 8 0.683 0.008 0.006 0.339, 0.395 0.347, 0.387 0.334, 0.400 0.343, 0.391 10 9 0.659 0.009 0.006 0.290, 0.347 0.299, 0.338 0.284, 0.352 0.295, 0.341 11 10 0.629 0.009 0.006 0.229, 0.287 0.238, 0.277 0.223, 0.292 0.234, 0.281 12 11 0.618 0.009 0.006 0.207, 0.266 0.217, 0.256 0.202, 0.271 0.214, 0.260 [ w] – – – Table 1: Extended statistical analysis of FRITZ 6 self-play results. bias involves random sampling with independent observations leading to a set of independent samples without doubles. Yet although sample independence provides a sufficient condition for unbiased sample sets, it is not a necessary one! Hence, sets of dependent samples may still be unbiased and thus allow for the application of standard statistics. In the case of self-play matches, you measure the score and use the overall scoring rate as a point estimate for the real winning probability. Now, a few duplicates covering the full range of possible outcomes (win, draw, loss) hardly influence the overall scoring rate of a match comprising hundreds or thousands of games. Therefore, a relatively small number of duplicates per match does not cause any significant bias in the sample. Thus, otherwise unbiased matches remain so even after adding some duplicate games. This is exactly the situation that occurs when playing matches with a random move selection algorithm drawing from a substantial opening book. My own self-play experiment did just that and all the others certainly did something very similar. To wrap it all up in brief, the statistical analyses of my self-play publications remain valid even without the removal of duplicate games. [Remark: I never brought up the issue of sample independence related to duplicate games explicitly in my publications because the previous self-play research listed hardly any details on game duplicates as well. Given that sample dependence is such an obvious point of criticism which the published literature failed to address for other self-play experiments, I simply deemed the argument settled along the lines of reasoning discussed above. Moreover, as seen in Subsection 2.2, the meaningful identification of duplicate games is a much trickier business than intuitively meeting the eye.] 3.2 Analysis with Exact W / D / L Numbers I did not include the exact W / D / L numbers (wins, draws, losses) in the statistical analyses of my own selfplay experiment (Heinz, 2000b, 2001a) because then the results were no longer comparable with my former detailed analyses of previous self-play experiments by others (Heinz, 2000a, 2001b). Furthermore, as Table 1 shows, the exact W / D / L analysis of my self-play experiment does not yield any significant new insights or conclusions to be drawn from the data. The table extends my already published statistical analyses (Heinz, 2000b, 2001a) by taking the exact W / D / L numbers of each match into account and showing the results of the according W / D / L calculations alongside the less exact results of the binomial calculations. For the sake of completeness and the ICGA Journal readers’ benefit, Appendix 6 briefly explains the fundamentals and notation of the statistical analyses employed in my publications. Together with Table 1 it completes the dissemination of information about my self-play experiment in this journal. The exact W / D / L numbers are easily incorporated into the statistical analysis by simply replacing the binomial standard error from Appendix 6: with the trinomial standard deviation WDL given below. Follow-up on “Self-play, Deep Search, and Diminishing Returns” 79
منابع مشابه
Factors Affecting Diminishing Returns for Searching Deeper 75 FACTORS AFFECTING DIMINISHING RETURNS FOR SEARCHING DEEPER
The phenomenon of diminishing returns for additional search effort has been observed by several researchers. We study experimentally additional factors which influence the behaviour of diminishing returns that manifest themselves in go-deep experiments. The results obtained on a large set of more than 40,000 positions from chess grandmaster games using the programs CRAFTY, RYBKA, and SHREDDER s...
متن کاملThe Effect of Family-Based Telephone Follow-Up on Self-Care of Patients with Diabetes
Introduction: Family-centered education of patients with diabetes mellitus can play an important role in controlling the disease and reducing its complications. Objective: The present study aims to determine the effect of family-based telephone follow-up about self-care in patients with type 2 diabetes mellitus in Uromia. Materials and Methods: The present clinical trial study investigated 60...
متن کاملThe Impact of Search Depth on Chess Playing Strength
How deep does a chess Grandmaster think? This question has been asked many times, and yet there is hardly a definite answer. Raw depth and pure calculation are certainly not the only factors in the thinking process of a chess player, but it would be interesting to know more about the relationship between search depth and playing strength, so that the strength of a given player (which is usually...
متن کاملFrequent small distractions with a magnetically controlled growing rod for early-onset scoliosis and avoidance of the law of diminishing returns.
PURPOSE To assess the effect of frequent small distractions with a magnetically controlled growing rod (MCGR) on spinal length gain and achieved distraction length in children with early-onset scoliosis (EOS), and to determine whether the law of diminishing returns applies to this group of patients with MCGR. METHODS A consecutive series of 3 males and 4 females with EOS who underwent MCGR im...
متن کاملThe effect of post-discharge telephone training and follow-up on self-care behaviors of myocardial infarction patients
Background and purpose: Patients with myocardial infarction need to receive care and self-care ability. The aim of this study was to determine the effect of post-discharge education and follow-up on self-care behaviors of patients with myocardial infarction. Materials and Methods: In this quasi-experimental study, 116 patients with myocardial infarction were selected by convenience sampling me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- ICGA Journal
دوره 26 شماره
صفحات -
تاریخ انتشار 2003